Research-Agent

Concurrent search, deduplication, URL content analysis, and structured AI report synthesis.

Track 01 · Pipeline. The research engine responsible for fetching, filtering, and synthesizing the web for a 6-agent content pipeline. Runs concurrent lookups across SerpAPI, Tavily, Brave, and DuckDuckGo, strips boilerplate navigation from raw pages, ranks sources by quality, and writes structured ResearchReports via Anthropic, OpenAI, or Groq. Extracted from production Agentic OS.

Open source github.com ↗
Track
Track 01 · Concurrent Pipeline
Runtime
Python 3.10+ pip install
Search APIs
SerpAPI Tavily Brave Search DuckDuckGo (Free)
Tests
9 tests covering search, URL fetching, ranking, and API routers
Repository

Research Pipeline : Annotated Reference

Query search engines concurrently, deduplicate URLs, analyze body content, and write structured research reports.

The problem

Standard LLMs cannot browse the live web without a search tool, but simply hooking an LLM to a search API is slow and noisy. If you ask an agent to research a topic, and it executes a single search query, reads one result, and writes a response, it inherits the bias and limits of that single source.

To generate high-quality research, you need a system that queries multiple search indexes (SerpAPI, Brave, Tavily) in parallel, filters out advertising, navigation headers, and duplicate links, reads and scores the content of 5+ web pages simultaneously, and feeds those structured insights into a final synthesis model.

How it works: step by step

  • Step 1: Concurrent Search Querying. The user inputs a topic and descriptive keywords. The system dispatches search queries concurrently across SerpAPI (Google), Tavily, Brave Search, and DuckDuckGo (which runs free without an API key). Running queries in parallel prevents any single slow API from blocking the pipeline.
  • Step 2: URL Deduplication & Selection. The search manager aggregates results, strips tracking parameters, and deduplicates identical URLs. It scores and ranks results based on domain authority and snippet relevance, selecting the top 5 targets for ingestion.
  • Step 3: Raw Content Scraping & Cleaning. The system fetches the HTML content of the selected URLs in parallel. It runs a parsing step that strips navigation headers, sidebars, footer blocks, and script nodes, extracting only the core text article block, capped at 5000 characters to prevent token bloat.
  • Step 4: LLM-Powered Source Evaluation. Each article is sent to a fast analyzer model (like GPT-4o-mini) to calculate a structured JSON score: quality (0-10), sentiment, and key insights. Sources that fall below a quality threshold are rejected.
  • Step 5: Report Synthesis. The high-scoring source summaries and core insights are sent to a high-capacity model (Anthropic Claude 3.5 Sonnet or OpenAI GPT-4o) via a provider-agnostic router. The model synthesizes these inputs into a formatted ResearchReport markdown file.

Interactive: Research Report Compiler

Simulate the research pipeline fetching sources concurrently, analyzing them, and synthesizing a final report.

Research Topic

Compiler Output

Awaiting topic...

Architecture and File Structure

  • src/llm/router.py: Handlers for multi-provider API calls. Manages key loading, default parameters, and cascading exceptions between Anthropic, OpenAI, and Groq.
  • src/search/providers.py: Implementations for SerpAPI, Tavily, Brave Search, and DuckDuckGo. Uses python asyncio to fire queries concurrently and merges results into unified data objects.
  • src/content/analyzer.py: The scraper and parser module. Extracts text using Beautiful Soup selectors, strips layout noise, and runs source analysis prompts.
  • src/agents/research_agent.py: Orchestrates the search, crawl, evaluation, and synthesis pipeline. Converts parameters into a final report.

Provider Cascades (LLMs & Search)

The system evaluates services dynamically, shifting load when APIs rate-limit or fail:

LLM Providers (Priority order) Search Providers (Concurrent execution)
1. Anthropic Claude (Best synthesis output, paid) SerpAPI (Google index search, requires key)
2. OpenAI GPT (High consistency, paid) Tavily (AI search specialist, requires key)
3. Groq (Fast response, free tier) Brave Search (Independent index, requires key)
4. Google Gemini (Fallback, free tier) DuckDuckGo (Free backup, runs out-of-the-box)

How to run it

git clone https://github.com/shubham0086/research-agent
cd research-agent
pip install -r requirements.txt
cp .env.example .env

# Run zero-key mode (uses DDG Search + mock LLM output)
python demo/run.py

# Run full research pipeline (requires keys)
python demo/run.py --topic "autonomous agent memory" --depth deep

Where this fits

Research Agent represents the **concurrent data harvesting** pipeline. It is Stage 2 of the Agentic OS content production line, running immediately after client onboarding:
Brief Intake → [Research Agent] → Content Strategist → Creator → QA → Formatter The structured data emitted by the Research Agent is fed directly into the Content Strategist to define the outlines of new projects.

Honest framing

The research report is only as good as the search queries you provide and the domains it ranks. A vague search query will retrieve generic articles, and the resulting LLM synthesis will read like a generic Wikipedia summary. To resolve this, we recommend running a "Keyword Expansion" step beforehand to feed the Search Manager highly specific keywords rather than a single sentences. Furthermore, the scraper fetches HTML pages; it does not parse complex media files or locked PDFs.